Correlation and Causation

  • Awhile back I've told you about the Simpson's paradox and it was surprising how easy it was to draw the false conclusion from data.
  • Today I will give you a deep insight of the common mistakes that's being made in interpreting statistical data by confusing correlation with causation. I'll show you example where data is correlated and why it's tempted to confuse correlation with causation. So both of those are words that start with a C and very frequently I read newspaper articles that deeply confuse both the relationship of correlation and causation--so let's dive in.

Screenshot taken from Udacity

Mortality

  • Suppose you are sick, and you wake up with a strong pain in the middle of the night. You so sick that you fear you might die, but you're not sick enough not to apply the lessons of my Statistics 101 class to make a rational decision whether to go to the hospital. And in doing so, you consult the titer.
  • You find that in your town, over the last year, 40 people were hospitalized of which 4 passed away. Whereas the vast part of the population of your town never went to the hospital, and of those, 20 passed away at home. So compute for me the percentages of the people who died in the hospital and the percentage of the people who died at home.

Screenshot taken from Udacity

Deciding

  • Now I offer these as a fictitious example – these are relatively large numbers. But what’s important to notice is the chances of dying in a hospital are 40 times as large than dying at home.
  • That means whether you die or not is correlated to whether or not you are in a hospital. So the chances of dying in a hospital are indeed 40 times larger than at home.
  • So let me ask the critical question. Shall you now stay at home, given that you are a really smart statistics student, can you resist the temptation to go to the hospital because indeed it might increase your chances of passing away.

Answer

  • Because based on the correlation data, it seems that being in a hospital makes you 40 times as likely to die than being at home but that doesn’t mean by staying at home, you reduce your chances of dying. So this is a statement of correlation.

Screenshot taken from Udacity

Assuming Causation

  • Being in a hospital, that fact alone, increases your probability of dying by a factor 40 is a causal statement. It says the hospital causes you to die. Not just it coincides with the fact that you die and very frequently people in the public get this wrong.
  • People observe there is a correlation but they suggest the correlation is causal in attempting to make you understand the statistic as a call of action.

Considering Health

  • Let's say of the 40 people in the hospital, 36 were actually sick and passed away, and some were healthy, 4 of them, and they all survived.
  • Let's further assume, for the people at home, 40 were indeed sick, and 50 of them passed away, whereas the remaining 7,960, they were healthy, also inquired a total death of 20, perhaps because of accidents.
  • These statistics are consistent with the statistics I gave you before. We just added another variable, whether the person's sick or healthy.

Answer

  • Now, if we look at this, we realize that you are likely sick.
  • If you fall into the sick category, your chances of dying at home are 50% and it's just about 11% in the hospital. So, you should really go to the hospital very quickly

Screenshot taken from Udacity

Correlation

  • Let's observe in more detail why the hospital example gives us such a wrong conclusion.
  • We study two variables--in-hospital and dying or passing away. We rightfully observe that these two things are correlated. If we were to do a scatter plot where we have two categories-- whether or not we're in the hospital and whether or not a person passed away-- you find there's an increased occurrence of data over here and of data over here relative to the other to data points over here. That means the data correlates.
  • What does correlation mean? In any plot, data is correlated if knowledge about one variables tells us something about the other.

Screenshot taken from Udacity

  • For this plot below, the data is not correlated. Because no matter where I am in A, B seems to be the same.

Screenshot taken from Udacity

  • And now the data sits, a square in which data is uniformly arranged, correlated, yes or no?

Answer

  • The answer is negative. No matter where I am in A, the range for B is the same, as is the mean estimate.

Screenshot taken from Udacity

  • Another data set--there's boomerang over here. Correlated?

Answer

  • The answer is yes--clearly for different values of A, I get different values of B. Not a linear correlation yet still a correlation.

Screenshot taken from Udacity

Causation Structure

  • So clearly in our example, whether or not you're in a hospital correlated with whether or not you died, but the truth is, the example omitted and important variable, the sickness, the disease itself.
  • And in fact, the sickness did cause you to die, and also affected your decision of whether you go to a hospital or not.
  • So if you draw acts of causation, you find sickness causes death, and sickness causes you to go to the hospital, and if anything at all, once you knew you were sick, being in the hospital negatively correlated with you dying; that is, being in a hospital made it less likely for you to pass away given that you were sick.
  • In statistics, we call this a confounding variable.
  • It's very tempting to just omit this from your data, and if you do, you might find correlations; in this case, a positive correlation between the hospital and death, that have nothing to do with the way things are being caused, and as a result, those correlations don't relate at all to what you should do.

Screenshot taken from Udacity

Fire Correlation

  • Suppose you observed a number of different fires and you graph the number of firefighters versus the size of the fire.
  • And for the sake of the argument, let's assume we studied four fires with 10, 40, 200, and 70 firefighters involved and the sizes of the fires were given as follows: 100, 400, 2000, and 700 in terms of the surface area that the fire occupied.
  • Putting this into a diagram, you get pretty much the following.
  • Put the number of fighters. In fact, you've already learned this looks very linear. So, let me ask a question--is the number of the firefighters correlated with the size of the fire?

Answer

  • And obviously it is because there's a strong linear correlation.

Screenshot taken from Udacity

  • Now the real question I'm bringing up to is, "Do firefighters cause fire?"
  • or more extremely, "If you're going to get rid of our firefighters, will you get rid of all the fire?" Obviously, this seems to be in the data.

Answer

  • And the answer is no. This is a case of what we call reverse causation.
  • You can argue that the size of the fire causes the number of firefighters that is being destroyed, and that's because the bigger the fire the more firefighters the fire department will send.
  • Now our graph, which shows the correlation between these two variables is oblivious to the direction of this arc. You could conclude size causes this and fire than the firefighters. You could conclude the number of firefighters causes this size.
  • In both cases you could use exact the same data. But when I put it this way, it's pretty obvious that the right answer should be the size of the fire causes the number of firefighters to grow up and it's not from the data itself.
  • It's because we know there's something about fire and firefighters. It's impossible to deduce from this data that it causes a relationship. It could be just coincidental or that cause a relationship could go either way.

Screenshot taken from Udacity


In [ ]: